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We show why the amount of information communicated between the past and future — the excess 
entropy — is not in general the amount of information stored in the present — the statistical com- 
plexity. This is a puzzle, and a long-standing one, since the latter is what is required for optimal 
prediction, but the former describes observed behavior. We layout a classification scheme for dy- 
namical systems and stochastic processes that determines when these two quantities are the same 
or different. We do this by developing closed- form expressions for the excess entropy in terms of 
optimal causal predictors and retrodictors — the e-machines of computational mechanics. A process's 
causal irreversibility and crypticity are key determining properties. 
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Constructing a theory can be viewed as our attempt to 
extract from measurements a system's hidden organiza- 
tion. This suggests a parallel with cryptography whose 
goal 01 is to not reveal internal correlations within an 
encrypted data stream, even though it contains, in fact, 
a message. This is essentially the circumstance that con- 
fronts a scientist when building a model for the first time. 

In this view, the now-long history in nonlinear dynam- 
ics to reconstruct models from time series concerns 
the case of self- decoding in which the information used 
to build a model is only that available in the observed 
process. That is, no "side-band" communication, prior 
knowledge, or disciplinary assumptions are allowed. Na- 
ture speaks for herself only through the data she willingly 
gives up. 

Here we show that the parallel is more than metaphor: 
building a model corresponds directly to decrypting the 
hidden state information in measurements. The results 
show why predicting and modeling are, at one and the 
same time, distinct and intimately related. Along the 
way, a number of persistent confusions about the role 
of (and different kinds of) information in prediction and 
modeling are clarified. We show how to measure the 
degree of hidden information and, along the way, identify 
a new kind of statistical irreversibility that plays a key 
role. ^ _^ 

Any process 'P{X,X) is a communication chan- 
nel: It transmits information from the past X ~ 
. . . X^j,X^2X^i to the future X = X0X1X2 • . • by stor- 
ing it in the present. Here Xt is the random variable 
for the measurement outcome at time t. Our goal is 
also simply stated: We wish to predict the future using 
information from the past. At root, a prediction is prob- 
abilistic, specified by a distribution of possible futures X 
given a particular past V: P(X|1e'). At a minimum, a 
good predictor needs to capture all of the information 



/ shared between past and future: E = /[AT; AT] — the 
process's excess entropy [4, and references therein]. 

Consider now the goal of modeling — to build a repre- 
sentation that not only allows good prediction, but also 
expresses the mechanisms that produce a system's behav- 
ior. To build a model of a structured process (a channel), 
computational mechanics 5] introduced an equivalence 
relation x x to group all histories that give rise to 
the same prediction — resulting in a map from pasts to the 
causal states: e(V) = {V' : P(X| ai") = P(Ar|1zr')}. A 
process's causal states, <S — P{X,X)/ ^, partition the 
space X of pasts into sets that are predictively equiv- 
alent. The set of causal states can be discrete, frac- 
tal, or continuous. State-to-state transitions are denoted 

(x) 

by matrices T^J, whose elements give the probability of 
transitioning from one state S to the next S' on seeing 
measurement value x. The resulting model, consisting of 
the causal states and transitions, is called the process's 
e-machine. 

Causal states have the Markovian property that they 
render the past and future statistically independent; 
they shield the future from the past [5|: F{X , X\S) = 
P{X\S)P{X\S). In this way, the causal states give a 
structural decomposition of the process into condition- 
ally independent modules. Moreover, they are optimally 
predictive in the sense that knowing which causal state 
a process is in is just as good as having the entire past: 
P(Ar|iS) = P(X|X). In other words, causal shielding is 
equivalent to the fact [HI that the causal states capture 
all of the information shared between past and future: 
I[S;X] = E. 

Out of all optimally predictive models 7t — for which 
I[JZ; X] = E — the e-machine captures the minimal 
amount of information that a process must store in or- 
der to communicate all of the excess entropy from the 
past to the future. This is the statistical complexity [^: 
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Cfj, = H[S] < HITZ]. In short, E is the information trans- 
mission rate of the process, viewed as a channel, and 
is the sophistication of that channel. 

In addition to E and C^, another key (and his- 
torically prior) invariant for dynamical systems and 
stochastic processes is the entropy rate /i^ which is the 
per-measurement rate at which the process generates 
information — its degree of intrinsic randomness Q . Im- 
portantly, the e-machine immediately gives two of these 
three important invariants: a process's rate (hfj,) of pro- 
ducing information and the amount (C^) of historical 
information it stores in doing so. 

To date, E cannot be as directly calculated or esti- 
mated as the entropy rate and the statistical complexity. 
This is truly unfortunate, since excess entropy, and re- 
lated mutual information quantities, are widely used di- 
agnostics for processes, having been applied to detect the 
presence of organization in dynamical systems 0, , 
in spin systems in neurobiological systems |lll.ll2l|. 

and even in language, to mention only a few applications. 
For example, in natural language the excess entropy ap- 
pears to diverge as E oc L^/^, reflecting the long-range 
and strongly nonergodic organization necessary for hu- 
man communication IJ, [14 1 . 

This state of affairs has been a major impediment to 
understanding the relationships between modeling and 
predicting and, more concretely, the relationships be- 
tween (and even the interpretation of) a process's basic 
invariants — /i^, C^, and E. Here we clarify these issues 
by deriving explicit expressions for E in terms of the e- 
machine, providing a unified information-theoretic anal- 
ysis of general processes. 

The above development of e-machines concerns using 
the past to predict the future. But what about retrod- 
icting, using the future to retrodict the past? Usually, 
one thinks of successive measurements occurring as time 
increases. Now, consider scanning the measurement vari- 
ables not in the forward time direction, but in the re- 
verse. The computational mechanics formalism is essen- 
tially unchanged, though its meaning and notation need 
to be augmented. 

With this in mind, the previous mapping from pasts 
to causal states is denoted e"*" and it gave, what we 
will call, the predictive causal states «S^. When scan- 
ning in the reverse direction, we have a new relation, 
X ~~ X ' , which groups futures that are equivalent for 
the purpose of retrodicting the past: e~{x) = {x' : 
T'{X\'x) = P{X\~x')}. It gives the retrodictive causal 
states S — P(X,X)/ And, not surprisingly, we 

must also distinguish a process's forward-scan e-machine 
M"*" from its reverse-scan e-machine . They assign 
corresponding entropy rates, hj^ and ft,^, and statistical 
complexities, C+ = and = H[S^], respec- 

tively, to the process. 

Now we are in a position to ask some questions. Per- 



haps the most obvious is. In which time direction is a 
process most predictable? The answer is that a station- 
ary process is equally predictable in either : /i~ = /i+ . 
Somewhat surprisingly, though, the effort involved in do- 
ing so is not the same [3l: C~ ^ C+. Naturally, E 
is mute on this score, since the mutual information / is 
symmetric in its variables |4|]. 

The relationship between predicting and retrodicting 
a process, and ultimately E's role, requires teasing out 
how the states of the forward and reverse e-machines cap- 
ture information from the past and the future. To do 
this we must analyze a four-variable mutual information: 
I[X\ X\S~ ]S^]. A large number of expansions of this 
quantity are possible. A systematic development follows 
from Ref. [16| which showed that Shannon entropy H[-] 
and mutual information /[•; •] form a measure over the 
space of events. 




H[X] 



H[X\ 
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FIG. 1: e-Machine information diagram for stationary hidden 
stochastic processes. 

Using an information diagram expansion, it turns 
out there are 15 possible relationships to consider for 
I[X; X;S~;S'^]. Fortunately, this greatly simplifies in 
the case of using an e-machine to represent a process: 
There are only five relationships. (See Fig. [1]) Simpli- 
fied in this way, we are left with our main results which, 
due to the preceding effort, are particularly transparent. 

Theorem 1. Excess entropy is the mutual information 
between the predictive and retrodictive causal states: 



(1) 



Notably, the process's channel capacity E ~ I[X;X] is 
the same as that of the "channel" between the forward 
and reverse e-machine states. Moreover, the predictive 
statistical complexity is given by = E + iJ[5^|iS~] 
and the retrodictive statistical complexity by = E + 
H[S-\S+]. 

Theorem [T] and its two companion results give an ex- 
plicit connection between a process's excess entropy and 
its causal structure — its e-machines. More generally, the 
relationships directly tie mutual information measures 
of observed sequences to a process's structure. They 
will allow us to probe the properties that control how 
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closely observed statistics reflect a process's internal hid- 
den structure; that is, the degree to which observed be- 
havior directly reflects internal state information. 

At this point we have two separate e-machines, one 
for predicting and one for retrodicting. We will now 
show that one can do better, by combining causal in- 
formation from the past and future. Consider scanning 
a realization, x = a: ( a: t, of the process in the forward 
direction — seeing histories x t and noting the series of 
causal states ^t"*" —e^(^t)- Now change direction. What 
reverse causal state is one in? This is S~ = e~{'xt). We 
describe the process of changing scan direction with the 
bidirectional machine M^, which is given by the equiva- 
lence relation 

= {(V, ~x') : IzT' G e^(lzr) and 1?' G (Ic)} 

and has causal states 5^ = P{X,X)/^'^ C x <S^. 
That is, the bidirectional causal state the process is in at 
time t is Sf^ = {e^ t) , {~x t)) ■ The amount of stored 
information needed to optimally predict and rctrodict a 
process is M^'s statistical complexity: = H[S^] = 
H[S-,S+]. 

From the immediately preceding results we obtain the 
following simple, useful relationship: E = C'^ + C~ —C^ . 
This suggests a wholly new interpretation of the excess 
entropy — in addition to the original three reviewed in 
Ref. y: E is exactly the difference between these sta- 
tistical complexities. Moreover, only when E = does 



ficient: _ 
C+ < C± and 



C,, . The bidirectional machine is also ef- 
< C,; + C~. And we have the bounds: 
< Ct;. These results say that taking 



into account causal information from the past and the 
future is more efflcient than ignoring one or the other 
and than ignoring their relationship. 

We noted above that predicting and retrodicting may 
require different amounts of information storage (C^ ^ 
C~ ) . It is helpful to use causal irreversibility to measure 

C~ . With the above 



this asymmetry ^^^j. ^ _ 

results, however, we see that S = H[S^\S~] — H[S~ \S^]. 
Note that irreversibility is also not controlled by E, as the 
latter is scan-symmetric. 

The relationship between excess entropy and statistical 
complexity established by Thm. [1] indicates that there 
are fundamental limitations on the amount of a process's 
stored information [C^) directly present in observations 
(E). We now introduce a measure of this: A process's 
crypticity is d{M+,M-) = H[S+\S-]+ H[S-\S+]. This 
is the distance between a process's forward and reverse 
e-machines and expresses most explicitly the difference 
between prediction and modeling. To see this, we need 
the following connection. 



Corollary 1. Af * 's statistical complexity is: 
C± = E + (i(M+,M-) . 



Referring to d as crypticity derives from this result: It 
is the amount of internal state information (C^) not di- 
rectly present in the observed sequence (E). That is, a 
process hides d bits of information. 

If crypticity is low {d « 0), then much of the stored 
information is present in observed behavior: E « C^. 
However, when a process's crypticity is high, d ~ C^, 
then little of it's structural information is directly present 
in observations. Moreover, there are truly cryptic pro- 
cesses (E « 0) that are highly structured (C^ 0). Lit- 
tle or nothing can be learned from measurements about 
such processes's hidden organization. 




FIG. 2: 

and M 



(2) 



Forward and reverse e-machines for the RIP: AI^ (a) 
(b). Edge labels t\x give the transition probabilities 
t — T^^J,. The bidirectional machine A/* (c) for p — q — 1/2. 
Edge labels here prepend the scan direction { — ,+}. 

The e-machine information diagram of Fig. [1] en- 
capsulates all of these results concisely. The diagram 
shows the key relationships between information pro- 
duction (_ff[Ar|5^] and _ff[Ar|5^]), excess entropy (E — 
I[X;X]), and stored information {C^ and C~). Ana- 
lyzing the 4-variable information diagram showed that 
there are only four convex sets of interest. These are 
depicted as differently shaded ellipses, [-'^J and i?[Ar] 
(two largest ellipses) are the entropies of the past and fu- 
ture, respectively, which are the process's total informa- 
tion production. The information stored in the predictive 
e-machine Af + is its statistical complexity: C+ = 
(small ellipse on left); likewise for M~, = H{S^) 
(small ellipse on right). The excess entropy E is the in- 
tersection of these sets; while the statistical complexity 
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of the bidirectional machine is their union; the 
crypticity d{M'^ , M~), their symmetric difference; and 
their signed difference, the causal irreversibility S. 

Consider an example that illustrates the typical 
process — cryptic and causally irreversible. This is the 
random insertion process (RIP) which generates a ran- 
dom bit with bias p. If that bit was a 1, then it outputs 
another 1. If the random bit was a 0, however, it inserts 
another random bit with bias followed by a 1. 

Its forward e-machine, see Fig. [2][a), has three recur- 
rent causal states = {A, B, C} and the transition 
matrices given there. Figure Hl^b) gives M~ which has 
four recurrent causal states S~ = {D, E, F,G}. We 
see that the e-machines are not the same and so the 
RIP is causally irreversible. A direct calculation gives 
P(5+) = P{A,B,C) = (l,p,l)/(p + 2) and P(5-) = 
P{D, E, F, G) = (1,1- pq,pq,p)/{p + 2).Up^q = 1/2, 
for example, these give us C+ ~ 1.5219 bits, w 1.8464 
bits, and /i^ =3/5 bits per measurement. The causal ir- 
reversibility is S « 0.3245 bits. 

Let's analyze its bidirectional machine; shown in Fig. 
djc) ior p = q = 1/2. The interdependence between the 
forward and reverse states is given by: 



Pr(5+,5-) = 
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By way of demonstrating the exact analysis now possible, 
E's closed-form expression for the RIP family is 
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where H{-) is the binary entropy function. The first two 
terms on the RHS are and the last is i?[iS'''|5~]. 

Setting p = q = 1/2, one calculates that P(5^) = 
P{AE,AG,BE,BF,CD) = (1/5,1/5,1/10,1/10,2/5). 
This and the joint distribution give G^ = H[S^] ~ 
2.1219 bits, but an E = /[5+;5"] = 1.2464 bits. That 
is, the excess entropy (the apparent information) is sub- 
stantially less than the statistical complexities (stored 
information) — a rather cryptic process: d ~ 0.8755 bits. 

To close, the main results establish that when d > 
one cannot simply use sequence information directly 
to represent a process as storing E bits of information. 
We must instead store G^ bits of information, building 
a causal model of the hidden state information. Why? 
Because typical processes encrypt their state information 
within their observed behavior. More precisely, observed 



information can be arbitrarily small (E w 0) compared 
to the stored information (C^). 

In deriving an explicit relationship between excess en- 
tropy and the e-machine, the framework puts prediction 
on an equal footing with modeling and so allows for a di- 
rect comparison between them. Also, as we demonstrated 
with the RIP example, it gives a way to develop closed- 
form expressions for E. Finally and most generally, it 
reveals an intimate connection between unpredictability, 
irreversibility, crypticity, and information storage. 

Practically, the results clear up persistent confusions 
in several literatures that conflate observed (mutual) in- 
formation and a process's stored information. Analyzing 
a process only in terms of mutual information misses an 
arbitrarily large amount of a process's structure. When 
this happens, one concludes that a process is more ran- 
dom than it is and that it has little structure, when nei- 
ther is true. 
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